Dataset statistics
| Number of variables | 5 |
|---|---|
| Number of observations | 6040 |
| Missing cells | 0 |
| Missing cells (%) | 0.0% |
| Duplicate rows | 0 |
| Duplicate rows (%) | 0.0% |
| Total size in memory | 236.1 KiB |
| Average record size in memory | 40.0 B |
Variable types
| Numeric | 3 |
|---|---|
| Categorical | 2 |
Zip-code has a high cardinality: 3439 distinct values | High cardinality |
Age is highly correlated with Occupation | High correlation |
Occupation is highly correlated with Age | High correlation |
UserID is uniformly distributed | Uniform |
UserID has unique values | Unique |
Occupation has 711 (11.8%) zeros | Zeros |
Reproduction
| Analysis started | 2022-07-14 02:31:40.296363 |
|---|---|
| Analysis finished | 2022-07-14 02:33:13.463592 |
| Duration | 1 minute and 33.17 seconds |
| Software version | pandas-profiling v3.2.0 |
| Download configuration | config.json |
| Distinct | 6040 |
|---|---|
| Distinct (%) | 100.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 3020.5 |
| Minimum | 1 |
|---|---|
| Maximum | 6040 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 47.3 KiB |
Quantile statistics
| Minimum | 1 |
|---|---|
| 5-th percentile | 302.95 |
| Q1 | 1510.75 |
| median | 3020.5 |
| Q3 | 4530.25 |
| 95-th percentile | 5738.05 |
| Maximum | 6040 |
| Range | 6039 |
| Interquartile range (IQR) | 3019.5 |
Descriptive statistics
| Standard deviation | 1743.742145 |
|---|---|
| Coefficient of variation (CV) | 0.5773024812 |
| Kurtosis | -1.2 |
| Mean | 3020.5 |
| Median Absolute Deviation (MAD) | 1510 |
| Skewness | 0 |
| Sum | 18243820 |
| Variance | 3040636.667 |
| Monotonicity | Strictly increasing |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) |
| 1 | 1 | < 0.1% |
| 4024 | 1 | < 0.1% |
| 4033 | 1 | < 0.1% |
| 4032 | 1 | < 0.1% |
| 4031 | 1 | < 0.1% |
| 4030 | 1 | < 0.1% |
| 4029 | 1 | < 0.1% |
| 4028 | 1 | < 0.1% |
| 4027 | 1 | < 0.1% |
| 4026 | 1 | < 0.1% |
| Other values (6030) | 6030 |
| Value | Count | Frequency (%) |
| 1 | 1 | |
| 2 | 1 | |
| 3 | 1 | |
| 4 | 1 | |
| 5 | 1 | |
| 6 | 1 | |
| 7 | 1 | |
| 8 | 1 | |
| 9 | 1 | |
| 10 | 1 |
| Value | Count | Frequency (%) |
| 6040 | 1 | |
| 6039 | 1 | |
| 6038 | 1 | |
| 6037 | 1 | |
| 6036 | 1 | |
| 6035 | 1 | |
| 6034 | 1 | |
| 6033 | 1 | |
| 6032 | 1 | |
| 6031 | 1 |
Gender
Categorical
| Distinct | 2 |
|---|---|
| Distinct (%) | < 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 47.3 KiB |
| M | |
|---|---|
| F |
Length
| Max length | 1 |
|---|---|
| Median length | 1 |
| Mean length | 1 |
| Min length | 1 |
Characters and Unicode
| Total characters | 6040 |
|---|---|
| Distinct characters | 2 |
| Distinct categories | 1 ? |
| Distinct scripts | 1 ? |
| Distinct blocks | 1 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique
| Unique | 0 ? |
|---|---|
| Unique (%) | 0.0% |
Sample
| 1st row | F |
|---|---|
| 2nd row | M |
| 3rd row | M |
| 4th row | M |
| 5th row | M |
Common Values
| Value | Count | Frequency (%) |
| M | 4331 | |
| F | 1709 | 28.3% |
Length
Histogram of lengths of the category
Category Frequency Plot
| Value | Count | Frequency (%) |
| m | 4331 | |
| f | 1709 | 28.3% |
Most occurring characters
| Value | Count | Frequency (%) |
| M | 4331 | |
| F | 1709 | 28.3% |
Most occurring categories
| Value | Count | Frequency (%) |
| Uppercase Letter | 6040 |
Most frequent character per category
Uppercase Letter
| Value | Count | Frequency (%) |
| M | 4331 | |
| F | 1709 | 28.3% |
Most occurring scripts
| Value | Count | Frequency (%) |
| Latin | 6040 |
Most frequent character per script
Latin
| Value | Count | Frequency (%) |
| M | 4331 | |
| F | 1709 | 28.3% |
Most occurring blocks
| Value | Count | Frequency (%) |
| ASCII | 6040 |
Most frequent character per block
ASCII
| Value | Count | Frequency (%) |
| M | 4331 | |
| F | 1709 | 28.3% |
| Distinct | 7 |
|---|---|
| Distinct (%) | 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 30.63923841 |
| Minimum | 1 |
|---|---|
| Maximum | 56 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 47.3 KiB |
Quantile statistics
| Minimum | 1 |
|---|---|
| 5-th percentile | 18 |
| Q1 | 25 |
| median | 25 |
| Q3 | 35 |
| 95-th percentile | 56 |
| Maximum | 56 |
| Range | 55 |
| Interquartile range (IQR) | 10 |
Descriptive statistics
| Standard deviation | 12.89596173 |
|---|---|
| Coefficient of variation (CV) | 0.4208969412 |
| Kurtosis | -0.2908100824 |
| Mean | 30.63923841 |
| Median Absolute Deviation (MAD) | 7 |
| Skewness | 0.2427000756 |
| Sum | 185061 |
| Variance | 166.3058289 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=7)
| Value | Count | Frequency (%) |
| 25 | 2096 | |
| 35 | 1193 | |
| 18 | 1103 | |
| 45 | 550 | 9.1% |
| 50 | 496 | 8.2% |
| 56 | 380 | 6.3% |
| 1 | 222 | 3.7% |
| Value | Count | Frequency (%) |
| 1 | 222 | 3.7% |
| 18 | 1103 | |
| 25 | 2096 | |
| 35 | 1193 | |
| 45 | 550 | 9.1% |
| 50 | 496 | 8.2% |
| 56 | 380 | 6.3% |
| Value | Count | Frequency (%) |
| 56 | 380 | 6.3% |
| 50 | 496 | 8.2% |
| 45 | 550 | 9.1% |
| 35 | 1193 | |
| 25 | 2096 | |
| 18 | 1103 | |
| 1 | 222 | 3.7% |
| Distinct | 21 |
|---|---|
| Distinct (%) | 0.3% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 8.146854305 |
| Minimum | 0 |
|---|---|
| Maximum | 20 |
| Zeros | 711 |
| Zeros (%) | 11.8% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 47.3 KiB |
Quantile statistics
| Minimum | 0 |
|---|---|
| 5-th percentile | 0 |
| Q1 | 3 |
| median | 7 |
| Q3 | 14 |
| 95-th percentile | 19 |
| Maximum | 20 |
| Range | 20 |
| Interquartile range (IQR) | 11 |
Descriptive statistics
| Standard deviation | 6.329511491 |
|---|---|
| Coefficient of variation (CV) | 0.7769270512 |
| Kurtosis | -1.21414437 |
| Mean | 8.146854305 |
| Median Absolute Deviation (MAD) | 5 |
| Skewness | 0.3382981095 |
| Sum | 49207 |
| Variance | 40.06271572 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=21)
| Value | Count | Frequency (%) |
| 4 | 759 | |
| 0 | 711 | |
| 7 | 679 | |
| 1 | 528 | 8.7% |
| 17 | 502 | 8.3% |
| 12 | 388 | 6.4% |
| 14 | 302 | 5.0% |
| 20 | 281 | 4.7% |
| 2 | 267 | 4.4% |
| 16 | 241 | 4.0% |
| Other values (11) | 1382 |
| Value | Count | Frequency (%) |
| 0 | 711 | |
| 1 | 528 | |
| 2 | 267 | 4.4% |
| 3 | 173 | 2.9% |
| 4 | 759 | |
| 5 | 112 | 1.9% |
| 6 | 236 | 3.9% |
| 7 | 679 | |
| 8 | 17 | 0.3% |
| 9 | 92 | 1.5% |
| Value | Count | Frequency (%) |
| 20 | 281 | |
| 19 | 72 | 1.2% |
| 18 | 70 | 1.2% |
| 17 | 502 | |
| 16 | 241 | |
| 15 | 144 | 2.4% |
| 14 | 302 | |
| 13 | 142 | 2.4% |
| 12 | 388 | |
| 11 | 129 | 2.1% |
| Distinct | 3439 |
|---|---|
| Distinct (%) | 56.9% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 47.3 KiB |
| 48104 | 19 |
|---|---|
| 22903 | 18 |
| 55104 | 17 |
| 94110 | 17 |
| 55455 | 16 |
| Other values (3434) |
Length
| Max length | 10 |
|---|---|
| Median length | 5 |
| Mean length | 5.058112583 |
| Min length | 5 |
Characters and Unicode
| Total characters | 30551 |
|---|---|
| Distinct characters | 11 |
| Distinct categories | 2 ? |
| Distinct scripts | 1 ? |
| Distinct blocks | 1 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique
| Unique | 2293 ? |
|---|---|
| Unique (%) | 38.0% |
Sample
| 1st row | 48067 |
|---|---|
| 2nd row | 70072 |
| 3rd row | 55117 |
| 4th row | 02460 |
| 5th row | 55455 |
Common Values
| Value | Count | Frequency (%) |
| 48104 | 19 | 0.3% |
| 22903 | 18 | 0.3% |
| 55104 | 17 | 0.3% |
| 94110 | 17 | 0.3% |
| 55455 | 16 | 0.3% |
| 55105 | 16 | 0.3% |
| 10025 | 16 | 0.3% |
| 94114 | 15 | 0.2% |
| 55408 | 15 | 0.2% |
| 02138 | 15 | 0.2% |
| Other values (3429) | 5876 |
Length
Histogram of lengths of the category
| Value | Count | Frequency (%) |
| 48104 | 19 | 0.3% |
| 22903 | 18 | 0.3% |
| 55104 | 17 | 0.3% |
| 94110 | 17 | 0.3% |
| 55455 | 16 | 0.3% |
| 55105 | 16 | 0.3% |
| 10025 | 16 | 0.3% |
| 94114 | 15 | 0.2% |
| 55408 | 15 | 0.2% |
| 02138 | 15 | 0.2% |
| Other values (3429) | 5876 |
Most occurring characters
| Value | Count | Frequency (%) |
| 0 | 5293 | |
| 1 | 4140 | |
| 2 | 3272 | |
| 4 | 3119 | |
| 5 | 3091 | |
| 3 | 2663 | |
| 9 | 2571 | |
| 6 | 2229 | |
| 7 | 2109 | 6.9% |
| 8 | 1998 | 6.5% |
Most occurring categories
| Value | Count | Frequency (%) |
| Decimal Number | 30485 | |
| Dash Punctuation | 66 | 0.2% |
Most frequent character per category
Decimal Number
| Value | Count | Frequency (%) |
| 0 | 5293 | |
| 1 | 4140 | |
| 2 | 3272 | |
| 4 | 3119 | |
| 5 | 3091 | |
| 3 | 2663 | |
| 9 | 2571 | |
| 6 | 2229 | |
| 7 | 2109 | 6.9% |
| 8 | 1998 | 6.6% |
Dash Punctuation
| Value | Count | Frequency (%) |
| - | 66 |
Most occurring scripts
| Value | Count | Frequency (%) |
| Common | 30551 |
Most frequent character per script
Common
| Value | Count | Frequency (%) |
| 0 | 5293 | |
| 1 | 4140 | |
| 2 | 3272 | |
| 4 | 3119 | |
| 5 | 3091 | |
| 3 | 2663 | |
| 9 | 2571 | |
| 6 | 2229 | |
| 7 | 2109 | 6.9% |
| 8 | 1998 | 6.5% |
Most occurring blocks
| Value | Count | Frequency (%) |
| ASCII | 30551 |
Most frequent character per block
ASCII
| Value | Count | Frequency (%) |
| 0 | 5293 | |
| 1 | 4140 | |
| 2 | 3272 | |
| 4 | 3119 | |
| 5 | 3091 | |
| 3 | 2663 | |
| 9 | 2571 | |
| 6 | 2229 | |
| 7 | 2109 | 6.9% |
| 8 | 1998 | 6.5% |
Phik (φk)
Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here. A simple visualization of nullity by column.
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.
First rows
| UserID | Gender | Age | Occupation | Zip-code | |
|---|---|---|---|---|---|
| 0 | 1 | F | 1 | 10 | 48067 |
| 1 | 2 | M | 56 | 16 | 70072 |
| 2 | 3 | M | 25 | 15 | 55117 |
| 3 | 4 | M | 45 | 7 | 02460 |
| 4 | 5 | M | 25 | 20 | 55455 |
| 5 | 6 | F | 50 | 9 | 55117 |
| 6 | 7 | M | 35 | 1 | 06810 |
| 7 | 8 | M | 25 | 12 | 11413 |
| 8 | 9 | M | 25 | 17 | 61614 |
| 9 | 10 | F | 35 | 1 | 95370 |
Last rows
| UserID | Gender | Age | Occupation | Zip-code | |
|---|---|---|---|---|---|
| 6030 | 6031 | F | 18 | 0 | 45123 |
| 6031 | 6032 | M | 45 | 7 | 55108 |
| 6032 | 6033 | M | 50 | 13 | 78232 |
| 6033 | 6034 | M | 25 | 14 | 94117 |
| 6034 | 6035 | F | 25 | 1 | 78734 |
| 6035 | 6036 | F | 25 | 15 | 32603 |
| 6036 | 6037 | F | 45 | 1 | 76006 |
| 6037 | 6038 | F | 56 | 1 | 14706 |
| 6038 | 6039 | F | 45 | 0 | 01060 |
| 6039 | 6040 | M | 25 | 6 | 11106 |